Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

نویسندگان

  • Ralf Steinberger
  • Sylvia Ombuya
  • Mijail A. Kabadjov
  • Bruno Pouliquen
  • Leonida Della Rocca
  • Jenya Belyaeva
  • Monica de Paola
  • Camelia Ignat
  • Erik Van der Goot
چکیده

The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenges and methods for multilingual text mining

Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Sel...

متن کامل

A survey of methods to ease the development of highly multilingual text mining applications

Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Sel...

متن کامل

Multilingual Media Monitoring and Text Analysis - Challenges for Highly Inflected Languages

We present the highly multilingual news analysis system Europe Media Monitor (EMM), which gathers an average of 175,000 online news articles per day in tens of languages, categorises the news items and extracts named entities and various other information from them. We also give an overview of EMM’s text mining tool set, focusing on the issue of how the software deals with highly inflected lang...

متن کامل

Multilingual Information Extraction with PolyglotIE

We present POLYGLOTIE, a web-based tool for developing extractors that perform Information Extraction (IE) over multilingual data. Our tool has two core features: First, it allows users to develop extractors against a unified abstraction that is shared across a large set of natural languages. This means that an extractor needs only be created once for one language, but will then run on multilin...

متن کامل

Transculturation and Multilingual Lives: Writing between Languages and Cultures

This paper looks at the issues of transculturation as explored in auto and semi-autobiographical accounts of linguistic and cultural transitions. The paper also addresses a number of questions about the structure of these texts, the authors’ linguistic competences, as well as questions about the theoretical and conceptual tool which may help us to discuss the issues the writers are reflecting o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 45  شماره 

صفحات  -

تاریخ انتشار 2011